Search CORE

72 research outputs found

Approximately Minwise Independence with Twisted Tabulation

Author: A. Broder
A.Z. Broder
E. Cohen
M. Datar
M. Pǎtraşcu
R.E. Fan
Y. Bachrach
Publication venue
Publication date: 01/01/2014
Field of study

A random hash function

h

\varepsilon

-minwise if for any set

S

|S|=n

, and element

x\in S

\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n

. Minwise hash functions with low bias

\varepsilon

have widespread applications within similarity estimation. Hashing from a universe

[u]

, the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes

c=O(1)

lookups in tables of size

u^{1/c}

. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields

\tilde O(1/u^{1/c})

-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79]

\tilde O(1/u^{1/c})

-minwise hashing requires

\Omega(\log u)

-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields

\tilde O(1/n^{1/c})

-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System

On the k-Independence Required by Linear Probing and Minwise Independence

Author: A. Pagh
A.Z. Broder
A.Z. Broder
E. Cohen
J.P. Schmidt
M.N. Wegman
P. Indyk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Comparison of Failures and Attacks on Random and Scale-Free Networks

Author: A.Z. Broder
A.Z. Broder
B. Bollobás
D.S. Callaway
M. Molloy
M.E.J. Newman
M.E.J. Newman
P. Crucitti
P. Erdös
R. Albert
R. Cohen
R. Cohen
R. Pastor-Satorras
S.N. Dorogovtsev
S.N. Dorogovtsev
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

It appeared recently that some statistical properties of complex networks like the Internet, the World Wide Web or Peer-to-Peer systems have an important influence on their resilience to failures and attacks. In particular, scale-free networks (i.e. networks with power-law degree distribution) seem much more robust than random networks in case of failures, while they are more sensitive to attacks. In this paper we deepen the study of the differences in the behavior of these two kinds of networks when facing failures or attacks. We moderate the general affirmation that scale-free networks are much more sensitive than random networks to attacks by showing that the number of links to remove in both cases is similar, and by showing that a slightly modified scenario for failures gives results similar to the ones for attacks. We also propose and analyze an efficient attack strategy against links

CiteSeerX

Crossref

HAL-Polytechnique

Dictionary matching in a stream

Author: A.V. Aho
A.Z. Broder
D. Breslauer
D. Breslauer
D.E. Knuth
M. Crochemore
M. Ružić
R. Clifford
R. Clifford
R. Clifford
R.M. Karp
Publication venue
Publication date: 01/01/2015
Field of study

We consider the problem of dictionary matching in a stream. Given a set of strings, known as a dictionary, and a stream of characters arriving one at a time, the task is to report each time some string in our dictionary occurs in the stream. We present a randomised algorithm which takes O(log log(k + m)) time per arriving character and uses O(k log m) words of space, where k is the number of strings in the dictionary and m is the length of the longest string in the dictionary

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Counting approximately-shortest paths in directed acyclic graphs

Author: A.Z. Broder
B. Lu
C. Burge
D. Naor
D. Štefankovič
J.M. Buhmann
L.G. Valiant
L.G. Valiant
M. Dyer
M. Jerrum
T. Chen
Publication venue
Publication date: 01/01/2013
Field of study

Given a directed acyclic graph with positive edge-weights, two vertices s and t, and a threshold-weight L, we present a fully-polynomial time approximation-scheme for the problem of counting the s-t paths of length at most L. We extend the algorithm for the case of two (or more) instances of the same problem. That is, given two graphs that have the same vertices and edges and differ only in edge-weights, and given two threshold-weights L_1 and L_2, we show how to approximately count the s-t paths that have length at most L_1 in the first graph and length at most L_2 in the second graph. We believe that our algorithms should find application in counting approximate solutions of related optimization problems, where finding an (optimum) solution can be reduced to the computation of a shortest path in a purpose-built auxiliary graph

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Scalable Mining of Common Routes in Mobile Communication Network Traffic Data

Author: A.Z. Broder
C. Song
C. Song
D.J. Patterson
G. Yavas
J. Hightower
K. Laasonen
L. Liao
M.C. González
T. Sohn
W. Massey
W. Rand
Publication venue
Publication date: 01/01/2012
Field of study

A probabilistic method for inferring common routes from mobile communication network traffic data is presented. Besides providing mobility information, valuable in a multitude of application areas, the method has the dual purpose of enabling efficient coarse-graining as well as anonymisation by mapping individual sequences onto common routes. The approach is to represent spatial trajectories by Cell ID sequences that are grouped into routes using locality-sensitive hashing and graph clustering. The method is demonstrated to be scalable, and to accurately group sequences using an evaluation set of GPS tagged data

Crossref

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Summary cache: a scalable wide-area Web cache sharing protocol

Author: A.Z. Broder
J. Almeida
Li Fan
Pei Cao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Cross-language high similarity search using a conceptual thesaurus

Author: A. Chowdhury
A.Z. Broder
D. Pinto
J. Dean
M. Anderka
M. Potthast
M.S. Charikar
P. Mcnamee
P.F. Brown
R. Steinberger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially funded by the European Commission as part of the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise 2.0 research project (TIN2009-13391-C04-03). The research work of the second author is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748

Crossref

RiuNet

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Incremental Web-Site Boundary Detection Using Random Walks

Author: A. Alshukri
A.Z. Broder
A.Z. Broder
B. Liu
D. Gomes
E.M. Rodrigues
I.H. Witten
J. Han
J. Pokorn
K. Bharat
L. Lovász
M.H. Dunham
P. Dmitriev
P. Senellart
R. Aleliunas
R. Kumar
R. Kumar
S. Abiteboul
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Crossref

Analysis of Agglomerative Clustering

Author: A.Z. Broder
Christian Sohler
Daniel Kuntze
F. Pereira
Johannes Blömer
K. Florek
K. Lee
L.L. McQuitty
M. Bādoiu
M. Charikar
M. Fréchet
M. Naszódi
M.B. Eisen
Marcel R. Ackermann
P.H.A. Sneath
R. Webster
S. Dasgupta
T. Feder
T.F. Gonzalez
W.B. Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/03/2014
Field of study

The diameter

k

-clustering problem is the problem of partitioning a finite subset of

\mathbb{R}^d

into

k

subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of

k

) is the agglomerative clustering algorithm with the complete linkage strategy. For decades, this algorithm has been widely used by practitioners. However, it is not well studied theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the dimension

d

is a constant, we show that for any

k

the solution computed by this algorithm is an

O(\log k)

-approximation to the diameter

k

-clustering problem. Our analysis does not only hold for the Euclidean distance but for any metric that is based on a norm. Furthermore, we analyze the closely related

k

-center and discrete

k

-center problem. For the corresponding agglomerative algorithms, we deduce an approximation factor of

O(\log k)

as well.Comment: A preliminary version of this article appeared in Proceedings of the 28th International Symposium on Theoretical Aspects of Computer Science (STACS '11), March 2011, pp. 308-319. This article also appeared in Algorithmica. The final publication is available at http://link.springer.com/article/10.1007/s00453-012-9717-

arXiv.org e-Print Archive

Crossref